Random Sampling from Databases
نویسنده
چکیده
Random Sampling from Databases by Frank Olken Doctor of Philosophy in Computer Science University of California at Berkeley Professor Michael Stonebraker, Chair In this thesis I describe e cient methods of answering random sampling queries of relational databases, i.e., retrieving random samples of the results of relational queries. I begin with a discussion of the motivation for including sampling operators in the database management system (DBMS). Uses include auditing, estimation (e.g., approximate answers to aggregate queries), and query optimization. The second chapter contains a review of the basic le sampling methods used in the thesis: acceptance/rejection sampling, reservoir sampling, and partial sum (ranked) tree sampling. I describe their usage for sampling from variably blocked les, and sampling from results as they are generated. Related literature on sampling from databases is reviewed. In Chapter Three I show how acceptance/rejection sampling ofB trees can be employed to obtain simple random samples of B tree les without auxiliary data structures. Iterative and batch algorithms are described and evaluated. The fourth chapter covers sampling from hash les: open addressing hash les, separately chained over ow hash les, linear hash les, and extendible hash les. I describe both iterative and batch algorithms, and characterize their performance. I describe and analyze algorithms for sampling from relational operators in Chapter Five: selection, intersection, union, projection, set di erence, and join. Methods of sampling from complex relational expressions, including select-project-join queries, are also described. In Chapter Six I describe the maintenance of materialized sample views. Here I combine sampling techniques with methods of maintaining conventional materialized views. I consider views de ned by simple queries consisting of single relational operators. The penultimate chapter covers sampling from spatial databases. I develop algorithms for obtaining uniformly distributed samples of points which satisfy a spatial predicate represented as a union of polygons in the database. Sampling algorithms from both quadtrees and R-trees are described, including spatial reservoir sampling algorithms. I conclude with a summary of the thesis and an agenda for future work.
منابع مشابه
Random Sampling from Database Files: A Survey
In this paper we survey known results on algorithms, data structures, and some applications of random sampling from databases. We first discuss various reasons for sampling from databases, and for inclusion of sampling as a DBMS operator. We consider basic sampling algorithms, sampling from trees, sampling from hash tables, and auxiliary memory resident index information to facilitate sampling.
متن کاملRandom Sampling from Databases - A Survey
This paper reviews recent literature on techniques for obtaining random samples from databases. We begin with a discussion of why one would want to include sampling facilities in database management systems. We then review basic sampling techniques used in construct-join are then described. We then describe sampling for estimation of aggregates (e.g., the size of query results). Here we discuss...
متن کاملSimple Random Sampling from Relational Databases
Sampling is a fundamental operation for the auditing and statistical analysis of large databases. It is not well supported in existing relational database management systems. We discuss how to obtain samples from the results of relational queries without first performing the query. Specifically, we examine simple random sampling from selections, projections, joins, unions, and intersections. We...
متن کاملRandom Sampling from B+ Trees
We consider the design and analysis of algorithms to retrieve simple random samples from databases. Specifically, we examine simple random sampling from B+ tree files. Existing methods of sampling from B+ trees, require the use of auxiliary rank information in the nodes of the tree. Such modified B+ tree files are called “ranked B+ trees”. We compare sampling from ranked Bt tree files, with new...
متن کاملEfficient Ad-hoc Approximate Query Processing in Peer-to-Peer Databases
1 This paper has appeared in The 22 International Conference on Data Engineering (ICDE) Atlanta, Georgia 2006. ABSTRACT Peer-to-peer databases are becoming prevalent on the Internet for distribution and sharing of documents, applications, and other digital media. The problem of answering large scale, ad-hoc analysis queries – e.g., aggregation queries – on these databases poses unique challenge...
متن کامل